N-Omniglot: a Large-scale Neuromorphic Dataset for Spatio-Temporal Sparse Few-shot Learning

Few-shot learning (learning with a few samples) is one of the most important capacities of the human brain. However, the current artificial intelligence systems meet difficulties in achieving this ability, so as the biologically plausible spiking neural networks (SNNs). Datasets for traditional few-shot learning domains provide few amounts of temporal information. And the absence of the neuromorphic datasets has hindered the development of few-shot learning for SNNs. Here, we provide the first neuromorphic dataset: N-Omniglot, using the Dynamic Vision Sensor (DVS). It contains 1623 categories of handwritten characters, with only 20 samples per class. N-Omniglot eliminates the need for a neuromorphic dataset for SNNs with high spareness and tremendous temporal coherence. Additionally, the dataset provides a powerful challenge and a suitable benchmark for developing SNNs algorithm in the few-shot learning domain due to the chronological information of strokes. We also provide the improved nearest neighbor, convolutional network, SiameseNet, and meta-learning algorithm in spiking version for verification. Background & Summary In recent years, the large scale datasets and increased computing power make machine learning, especially deep learning, reach the level of human-like performance in many areas1–3. However, compared with the human brain, artifical neural networks(ANN) lack of biological characteristic and interpretability, for its floating-point-based calculation and gradient-based algorithm4. Combining computer technology and computational neuroscience related knowledge can effectively improve the current deep learning technology. Spiking neural networks (SNNs) are considered to be the third generation of artificial neural networks5, by simulating the similar calculations and representations in the human brain, which shows strong biological interpretability. Only neurons that fire spikes will participate in the calculation of the network. Meanwhile, the sparse spike activity greatly reduces the network’s energy consumption6. However, the lack of datasets for SNNs burden the development of SNN algorithm. The success of deep learning can largely attribute to the introduction of datasets such as ImageNet7 and COCO8. However, the currently widely used datasets are not suitable for SNNs. SNNs need to encode the static data into spike trains and then put them into the network9. As a result, the information will be missing, and it will be not fair to compare with the artificial neural networks. Dynamic Vision Sensor (DVS)10 is a new neuromorphic camera. DVS only generates 0/1 events on pixels with different light intensities to achieve low latency, low redundancy, and high time resolution, which is different from the ar X iv :2 11 2. 13 23 0v 2 [ cs .N E ] 2 8 D ec 2 02 1 frame-based cameras. In addition, DVS simulates the human visual nervous system in principle so that SNNs can make full use of the temporal information provided by such sensors. To promote the development of SNNs, researchers used DVS to provide many neuromorphic datasets. N-MNIST11, N-caltech101, DVS-CIFAR1012 are obtained through the picture or camera jitter, captured from traditional classification datasets, avoiding the damage to original image information during the encoding process. In addition to the conversion of static images, more researchers tend to obtain event data from natural environments, such as DVS-Gesture13, N-Cars14, etc. But the existed datasets, such as those mentioned above, have very low temporal correlation and cannot fully reflect the temporal information processing capabilities of SNN. All the characteristics are shown in Table 1. In addition to sparse coding15 to reduce energy consumption, learning new concepts rapidly from a few samples is also one of the important capabilities of the human brains. While it is an open problem in spike-based machine learning. The few-shot learning16, 17 imposes tremendous challenges to the current learning methodologies of SNNs due to the lack of neuromorphic datasets18 for training and evaluating the learning ability of a few samples. Dataset # of classes Sparsity Difference Object N-MNIST 10 Low Low Image N-Caltech101 100 Low Low Image DVS-CIFAR10 10 Low Low Image N-Cars 2 Low Mid Gesture DVS-Gesture 11 Low Mid Cars N-Omniglot 1623 High High Stroke Table 1. The Characteristics of the neuromorphic datasets. To tackle the problems and fulfill this gap, we propose the first neuromorphic dataset for few-shot learning: N-Omniglot. The original Omniglot dataset19 is the most commonly used dataset in the field of few-shot learning. It consists of 1623 handwritten characters from 50 different languages. Each character has only 20 different samples. It is usually recognized as a static character image, while the rich temporal information of the writing process is ignored. Therefore, we reconstruct the Figure 1. Some examples of N-Omniglot. High temporal correlation and spatio-temporal sparsity can be seen in these examples.

characters, with only 20 samples per class. N-Omniglot eliminates the need for a neuromorphic dataset for SNNs with high spareness and tremendous temporal coherence. Additionally, the dataset provides a powerful challenge and a suitable benchmark for developing SNNs algorithm in the few-shot learning domain due to the chronological information of strokes.
We also provide the improved nearest neighbor, convolutional network, SiameseNet, and meta-learning algorithm in spiking version for verification.

Background & Summary
In recent years, the large scale datasets and increased computing power make machine learning, especially deep learning, reach the level of human-like performance in many areas [1][2][3] . However, compared with the human brain, artifical neural networks(ANN) lack of biological characteristic and interpretability, for its floating-point-based calculation and gradient-based algorithm 4 . Combining computer technology and computational neuroscience related knowledge can effectively improve the current deep learning technology. Spiking neural networks (SNNs) are considered to be the third generation of artificial neural networks 5 , by simulating the similar calculations and representations in the human brain, which shows strong biological interpretability. Only neurons that fire spikes will participate in the calculation of the network. Meanwhile, the sparse spike activity greatly reduces the network's energy consumption 6 . However, the lack of datasets for SNNs burden the development of SNN algorithm.
The success of deep learning can largely attribute to the introduction of datasets such as ImageNet 7 and COCO 8 . However, the currently widely used datasets are not suitable for SNNs. SNNs need to encode the static data into spike trains and then put them into the network 9 . As a result, the information will be missing, and it will be not fair to compare with the artificial neural networks. Dynamic Vision Sensor (DVS) 10 is a new neuromorphic camera. DVS only generates 0/1 events on pixels with different light intensities to achieve low latency, low redundancy, and high time resolution, which is different from the arXiv:2112.13230v2 [cs.NE] 28 Dec 2021 frame-based cameras. In addition, DVS simulates the human visual nervous system in principle so that SNNs can make full use of the temporal information provided by such sensors.
To promote the development of SNNs, researchers used DVS to provide many neuromorphic datasets. N-MNIST 11 , N-caltech101, DVS-CIFAR10 12 are obtained through the picture or camera jitter, captured from traditional classification datasets, avoiding the damage to original image information during the encoding process. In addition to the conversion of static images, more researchers tend to obtain event data from natural environments, such as DVS-Gesture 13 , N-Cars 14 , etc. But the existed datasets, such as those mentioned above, have very low temporal correlation and cannot fully reflect the temporal information processing capabilities of SNN. All the characteristics are shown in Table 1. In addition to sparse coding 15 to reduce energy consumption, learning new concepts rapidly from a few samples is also one of the important capabilities of the human brains. While it is an open problem in spike-based machine learning. The few-shot learning 16,17 imposes tremendous challenges to the current learning methodologies of SNNs due to the lack of neuromorphic datasets 18  To tackle the problems and fulfill this gap, we propose the first neuromorphic dataset for few-shot learning: N-Omniglot.
The original Omniglot dataset 19 is the most commonly used dataset in the field of few-shot learning. It consists of 1623 handwritten characters from 50 different languages. Each character has only 20 different samples. It is usually recognized as a static character image, while the rich temporal information of the writing process is ignored. Therefore, we reconstruct the

2/11
writing process of strokes and use DVS to obtain the event records to get the neuromorphic version of Omniglot (N-Omniglot).
We provide several improved classic few-shot learning algorithms to adapt to SNN, showing that N-Omniglot diverses in time-dimension, providing more temporal information, and support many tasks. We hope it can provide a benchmark for SNN-based few-shot learning and provide a competitive environment for the research community to promote SNN's temporal and spatial feature extraction ability and sparse representation learning.

Methods
In this work, we first use the stoke temporal information to reconstruct the writing process of Omniglot into videos. For the convenience of capturing, we merge the writing strokes of 20 samples of each character into a video file, and a blank sequence is inserted between each sample. Second, we use the DVS acquisition platform to shoot videos which are played on the monitor, and use the Robotic Process Automation (RPA) software to automatically collect the data. Finally, the corresponding sample data will be spilt out. Figure 2 shows the entire construction process of the dataset.

Figure 2.
Complete process of data generation. Phase A constructs the video for recording, including a1: preprocessing the original data and a2: reconstructing the video. Phase B is the actual capture stage, including b1: building the equipment environment, and b2: recording with the RPA software. Phase C performs post-processing, including c1: labeling the beginning and end of the characters, and c2: segmentation using time stamp.

Stroke preprocessing and reconstruction
Each image in Omniglot has a corresponding stroke data in milliseconds. In order to present the entire writing process to the DVS, we first reconstruct the text record of strokes as a video of writing tracks. Also, because of the difference in acquisition equipment and writing habits, we delete the interval generated when each stroke is written. The linear interpolation algorithm is used to complete the data in milliseconds to reconstruct the character writing as accurately as possible. Due to the inconstant frequency of sampling and the jitter during the writing process, some strokes have only one or a few points, and the refresh rate of the display screen is 60Hz, so strokes less than 17ms in the reconstructed video may not be displayed. Therefore, we linearly interpolate them to a sufficiently long length, here we use 34ms.

Automated capture using davis346
We use Davis346 as our acquisition device due to its good time resolution. We design a black box to cover the screen and DVS camera to prevent external light changes from interfering with the experimental data collection. In the experiment, the DV

3/11
software is used to process the captured event data. We set the background activity time parameter in the DV software to 4000 to better filter out the background noise from the input, such as the low-frequency noise displayed on the monitor. As well as, the exposure parameter is fixed to 8000 to keep the brightness stable. In order to avoid frequent software operations that cause major changes in the relative position of the device, we use the Robotic Process Automation (RPA) software UiBot to automatically collect and record data. As shown in the Figure 2, we first read the address of the reconstructed stroke video.
During the event conversion process, the record and stop buttons are pressed at the beginning and end of the video playback.
Finally, the recorded .aedat4 files are saved to the present directory.

Segmentation and preprocess for usage
Each character in Omniglot contains 20 samples. To avoid unnecessary software operations and make the collected DVS event data more stable and efficient, we combine the 20 reconstructed videos into a long one, with a 500ms gap between each sample, the beginning and the end of the video. After converting the stroke video to the event file, we separate the events corresponding to each sample. Specifically, when the events change from sparse spatial-temporal property to concentrated in a certain position for a long time, the sample event starts. And we use the frame number of the original video to assist in finding the corresponding time node to determine the end of the sample event. We save the event data with the form of (x, y, t, p), where the first two items x, y are the pixel coordinates of the event, the third item t is the timestamp of the event, and the fourth item p is the polarity with value 1 and 0 indicates the increase or decrease of brightness separately. In our paper, considering that SNN is always clock-driven at runtime and cannot perform asynchronous calculations on all neuron units like FPGAs, so the event data will be processed into image data to input into the network. We process the events within a period of time into an image with a resolution of DVS346 (346*260). The two polarities are represented by two image channels, and the pixels without events are filled with 0. We mainly used two methods: the OR operation and the firing rates, the details will be shown in the technical validation section.

Visual Analysis of N-Omniglot
To better analyze the property of N-Omniglot, we calculate the statistical characteristics of N-Omniglot, as shown in the Table 2. The maximum and minimum values of the horizontal and vertical coordinates are used to crop the image to remove unnecessary input. Also, the average writing time of the strokes of each sample is 4s, which is realistic. In order to better illustrate the difference between the N-omniglot dataset and other neuromorphic datasets, we visualize some samples in NMNIST, DVS-Gesture, DVS-CIFAR10 and N-Omniglot, as shown in the Figure 3. We use OR operation to compress all event data into 12 image frames, and select 3 frames of processed images. It can been seen that for the first three neuromorphic datasets, the three image frames are very similar, and the activity is very dense, whose difference is small compared with input the static images. However, the N-omniglot dataset has large differences between frames and very sparse activities, which provides a greater challenge for building a high-performance spatio-temporal information processing algorithm, and the temporal information of the stroke order is crucial for the recognition and generation of characters. Significantly, this can also provide a good benchmark for the few-shot generation tasks.  Because the data is collected using DV software, the stored file is aedat4 files. In addition, to minimize the additional operations of the acquisition process, we combine 20 samples into one video, using the method mentioned above to segment them. We save the beginning and ending timestamps of each sample in a CSV file. In order to better show the progress of our data compared with the previous data, we have displayed the event data on the homepage. We separate the files into different directories to facilitate maintenance and use while maintaining the same structure as Omniglot. As shown in the Figure

Technical Validation
We notice that SNN is still in its infancy in the filed of neuromorphic few-shot learning, and there is almost no suitable algorithm to support this task. In order to prove the effectiveness of N-Omniglot and the potential of providing new challenges for the training of SNN algorithms, experiments are conducted on four SNN algorithms, including two general classic pattern classification methods and two few-shot learning algorithms. Also, to demonstrate the difference of N-Omniglot and the encoded Omniglot, all the experiments are performed on both of them. Then, we compare the differences between the encoded static dataset and the neuromorphic dataset.

Encoding and Preprocessing
Static images do not contain temporal information. Therefore, in order to match the characteristics of SNN in processing spatio-temporal information, static images are usually processed into spike trains. As shown in Figure 5  does not need to be encoded. However, neuromorphic datasets are acquired by DVS with high temporal resolution, and the excessively long timeline is a huge burden for current Clock-driven SNN algorithms. Therefore, datasets captured by DVS need to be merged in the temporal dimension. In our experiment, we divide the data by time average and combine them in two ways: the OR operation and spike firing rate, as shown in Figure 5 (b) and (d).

Nearest Neighbor
As a classical pattern classification method, the nearest neighbor 23,24 (NN) method can evaluate the separability of samples to a certain extent and provide a benchmark for other algorithms to compare. As shown in Figure 6 (c), the NN method compares the input sample with each in the training set and finds the sample closest to the input according to the given distance measurement function. Then the category of the input sample can be decided by the neighbor. In the experiment, we use the Euclidean distance between different samples as the distance measure function.

Classification Directly
The most considerable difficulty of few-shot learning lies in a large number of classification categories and the small number of samples per category. But even still, such tasks can be handled directly as general classification problems. Therefore, we construct a spiking convolutional neural network with LIF neurons to process the dataset. It consists of two convolution layers and two fully connected layers. Each convolutional layer is followed by an average pooling layer with step size 2. Due to the non-differentiable character of SNNs, we use the approximate gradient method of STBP 25 for training. Even if it is not specially designed for few-shot learning tasks, the algorithm can still obtain relatively efficient results. On the other hand, as a comparison, we also design an ANN network model, as shown in Figure 5 (a), which uses the convolution layer as a feature extractor, while Long Short-Term Memory (LSTM) 26 , which is a machine learning algorithm commonly used to extract time features, is employed to combine the information of Time dimension.

Siamese Net
The Siamese Net 27 is a classical few-shot learning algorithm based on metrics. Because the original Siamese Net could not handle the neuromorphic dataset, we improve it by using SNN with LIF neurons as the basic network to add the ability to process temporal information for the model. The Siamese Net inputs two samples at the same time. If the sample pairs belong to the same category, they are marked as 1; otherwise, they are marked as 0. The network compares the two samples to determine whether they belong to the same category. As shown in Figure 6 (b), the two samples share the first half of the network structurally, and the difference between the two feature maps is input into the later fully connected layer. During the test phase, the given query set is compared with the samples of the support set one by one, and the category with the largest probability value is output as the classification result. As shown in the table, the performance of the four methods on N-Omniglot is lower than the results on Omniglot. The first reason is that the proposed dataset is more sparse in the spatial dimension. The similarity of data in the temporal dimension is lower than the input based on static image or Poisson coding, which brings a new challenge to SNN learning. Another reason is the lack of preprocessing methods for neuromorphic datasets. It can be seen from the table that the preprocess method based on the event frame has better performance than that based on firing rate. When using firing rate, due to data sparseness in the spatial and temporal dimension, the difference of floating-point values synthesized by preprocessing is significant and not conducive to spikes' generation and transmission. Therefore, new requirements are needed for the preprocessing method of the spiking train. It is worth noting that we simultaneously test the identification accuracy of the two classical few-shot learning methods at different simulation times. The results show that the longer the simulation time, the lower the accuracy. It is because the longer the simulation time, the more frames the event is divided into, and the more difficult it is to connect information between frames. It indicates that the data descriptor is essential for improving SNN's ability to extract more important spatio-temporal features. Therefore, the N-Omniglot proposed in this paper can be considered an effective, robust, and challenging dataset.

Usage Notes
We provide three data interfaces for N-Omniglot to meet the requirements of different algorithms for data loading. In  Table 3. Accuracy of various modified classical methods on N-Omniglot.
created with the same structure as Omniglot, containing all the NumPy format samples, so that no more time will be spent here later. According to the different preprocessing methods, data with different frames can be obtained and directly input into the neural network. Researchers can directly use these four improved SNN few-shot learning algorithms to process datasets, or develop new algorithms to preprocess N-Omniglot datasets and propose novel few-shot learning algorithms suitable for neuromorphic datasets.

Code availability
Preprocessing code for the dataset and few-shot learning algorithms to verify its quality can be found here: https://bii. ia.ac.cn/braincog/dataset/N-Omniglot/. The code uses Python3 and PyTorch platforms, and the Torchvision package version is expected to be higher than 0.8.1. Please refer to the Usage Notes section and ReadMe file to run the code.